Void Finding

By Mairead Heiger for AST1430: Cosmology at the University of Toronto

Methods

The general steps in the void finding algorithm are as follows:

Given input data points D = {D_x, D_y} = {(x_1, y_1), (x_2, y_2)...(x_n, y_n)}, calculate the Voronoi diagram V.

Calculate the density (rho = 1 / area_C) at every point in D with cell C in V.

Given an MxM grid, interpolate the density at every point p = (i,j).
    If interpolation method is nearest neighbor:
        density_ij = rho(closest point in D to p)
    If interpolation method is distance weighted:
        choose _k_
        density_ij = mean(rho(closest _k_ points in D to p))
    If interpolation method is Voronoi neighbors:
        Insert p into V with cell C_p.
        For all n cells in V:
            If C_p and and C_n share a wall:
                point D_n in cell C_n is a neighbor of p
        density_ij = mean(density of neighbors)

Smooth density with Gaussian blur.

Calculate a threshold density value.

For each grid point (i,j) in density:
    If density_ij < threshold:
        (i,j) is a void.
    Else:
        (i,j) is not a void.


Return to top ↑

3 interpolation methods

The data points need to be converted into a density map in order to find the voids. This requires interpolation between the points, which is done using the Voronoi diagram of the data points.

The biggest challenge was the (mis)identification of "trivial" voids. There is no set definition of what constitutes a void: how empty is empty enough? Do they have to be a certain size? Without specifying such cut-offs, different methods of interpolation and binarization yielded more voids than others. Within the test data, they all find the actual intended void to some extent, and some find the space between points as voids and others don't. I tried 3 methods of increasing complexity with varying success.

  1. #### Simplest: Nearest neighbors. Assign each grid point the same density as the data point closest to it. Since a Voronoi diagram is constructed such that all points within a Voronoi cell are closer to the 'seed' data point than any other, this method produces a density map that is a poorly resolved Voronoi diagram. This method identifies the void but also identifies a considerable number of 'trivial' voids that are essentially noise in the density map. It improves considerably after smoothing.

  2. #### Moderate: Distance weighting. Assign each grid point the average density of the k closest points. The density map is smoother with this method than with the nearest neighbor interpolation, but it struggles at the edges of voids. Grid points that are squarely in the void but at the edges pull their density primarily from data in high density areas outside of the void, artificially raising their density. It essentially washes out the density map and makes it harder to identify the true size of the void. It's quite ineffective overall and not very functional.

  3. #### Most complex: Voronoi Neighbors. Assign each grid point the average density of its Voronoi neighbors. In a Voronoi diagram, a seed point's neighbors are the points whose cells border its own. In this case, the Voronoi neighbors of the grid points are the original data points whose cells border the grid point if the grid point is included in the Voronoi diagram. This method is based on natural neighbor interpolation, which several void finding algorithms use, but is much simpler. Despite being much simpler, it's still very slow because it recalculates a Voronoi diagram for every grid point and then searches through the entire diagram for neighbors. Regardless, it seems to be the most effective at producing a smooth density map and identifying large voids without picking up as much noise. After smoothing, however, the performance of the nearest neighbor method is very similar.

With one void, the nearest neighbor method improves considerably with Gaussian smoothing, as do the other two (but not as much). Even with blurring, the distance weighted method is still not very good. Blurring also washes out small voids, which is a particular problem with the Voronoi neighbors method with a coarse grid.

Smoothing

After interpolation, the density map is smoothed. Depending on the interpolation method, the resulting map is very noisy, which Gaussian smoothing/blurring reduces. The noise is liable to be detected as a void, so smoothing helps minimize detection of trivial voids and non-voids without sacrificing the detection of larger voids, since they are on a larger scale. There's a risk that it will wash small voids out completely, however. Smaller scale smoothing (smaller $\sigma$) is less liable to do so. Smoothing at the same scale would not be appropriate for every dataset, so the algorithm returns results both with and without smoothing, and can always be applied with more fine tuning on the unsmoothed result.

Gaussian smoothing is done with a built-in sci-kit filter: https://scikit-image.org/docs/stable/api/skimage.filters.html#skimage.filters.gaussian

Thresholding

After smoothing, the final density map is ready for a threshold analysis! This is a binarization: all points with densities below a certain threshold are voids; above the threshold, they are not voids. As with the interpolation methods, the threshold changes what is considered a void and what isn't. Some methods are very liable to find many trivial voids, or to interpret noise as voids. The binarization is calculated using ski-kit filters as well: https://scikit-image.org/docs/stable/auto_examples/applications/plot_thresholding.html.

The final result is a black and white image, where black areas are voids.

Return to top ↑

Next steps

This algorithm, though simple, does separate areas in a plane with no galaxies/points from areas with them, with some margin. Overall, for large voids, there isn't (by eye) a large difference between the smooothed nearest neighbor method and the Voronoi neighbor method. The distance weighted method never performs well, regardless of the size, shape, or number of voids, how much smoothing is or isn't applied, or how coarse/fine the grid is (although a coarse grid actually does better). It's included in all tests, but ultimately, it's not functional. For the test data with smaller voids, a finely sampled grid is necessary to avoid washing out the small voids, and the Voronoi neighbor method better captures the true shape of the voids. Without knowing what should or shouldn't count as a void in real data, the Voronoi neighbor method is likely the more accurate of the two at distinguishing voids and their shape, but it's much more computationally intensive. Searching the entire grid for neighbors is prohibitive, especially for a large dataset or a very fine grid. It could absolutely be coded more efficiently (only searching nearby for neighbors, inserting multiple points far away from each other and searching once for all of them, etc.), but in its current form, it's slow. The smoothed nearest neighbor method is a good compromise between efficiency and accuracy.

The thresholding method is currently a case-by-case determination, but two methods stand out. Otsu binarization uses a the threshold that maximizes variance between background (void) and foreground. It is the threshold often used in image processing as a step in a watershed algorithm, which several void finders use. It works well on the test data and is shown below. Mean thresholding, where the threshold is the mean value of the image, identifies voids well in test data, but captures noise. However, because we cannot definitively say what is as a void and what is noise/non-void space between structures is, identifying more trivial voids may be preferable to missing them, and mean thresholding is very simple.

Going forward, this algorithm could be applied to larger real datasets. In the simplest test case done here, a void has a true density of zero (no galaxies/points). However, from observational data, we know that voids are not truly empty--a more physical void would be better described as having an extremely low density (whatever 'extreme' may be). This algorithm is flexible enough to still apply, because although the true (test) data assumes that a void has zero density, that assumption is not passed on to the algorithm. It would separate non-zero, low density areas from higher density areas, and does not require prior selection of the characteristics of the void. As with other void finding algorithms, it could be applied to datasets of galaxies at varying redshifts to analyze the evolution of structure over time. It could also be used to analyze the size of voids and compare them to predictions for different combinations of cosmological parameters, notably dark matter and energy density, and theories of modified gravity.

Currently, the best the algorithm can do is identify 'void' and 'not void' in broad strokes, but the cosmic web is more complex than just those two structures. A binarization will always return 'void'/'not void', but its accuracy is in part due to the assumption that the galaxies are points and the density of space is related only to how close points are to other points (which is reflected in the area of the Voronoi cells). Although the area of the cell is strongly related to the local physical density, it is not a mass density and has flaws. For example, if a very massive galaxy is in a filament rather than a cluster, this algorithm may misrepresent the size of the void. A true mass density (where the area is based on the size of the galaxy, not the Voronoi cell, or some combination of the two) has clear physical meaning and might account in part for the fact that the cosmic web is comprised of filamentary and wall-like structures as well as clusters and voids.

The algorithm could also be improved, likely considerably, by a better thresholding method. Some exisiting void finding algorithms use watershed methods, for which binarization is one step in the final processing. Another good option is automatic thresholding. Automatic thresholding calculates a threshold from an image, separates it into foreground/background, then calculates the threshold from the foreground/background individually, and, if the average of the two is below a chosen value, that threshold is applied to the original image, and the process repeats. There are also some other buggy things about the way the code is written--right now, the algorithm considers the input data to be a square, so the non-square SDSS data has a unphysically low density border that changes the threshold value.

Return to top ↑

The actual code!

Test Data

Test the void finding algorithm on randomly generated points with a single obvious void

Return to top ↑

Calculate the density maps

If you re-run it, just a heads up that it's slow! The Voronoi neighbor interpolation is crazy inefficient (takes 20+ mins on my laptop for the 100x100 grid...). A coarse 20x20 grid captures the very broad strokes and takes a more reasonable amount of time)

Return to top ↑

Plot the density maps

Return to top ↑

Thresholding

Return to top ↑

Compare thresholds

Multiple voids

Test the code now on toy data with multiple smaller voids.

Return to top ↑

Density maps

Return to top ↑

Thresholding

Return to top ↑

SDSS Data

Now that we know the algorithm has mediocre to bad performance on multiple voids, let's try it on some actual data!

Return to top ↑

Density maps

Return to top ↑

Thresholding

Return to top ↑

Return to top ↑